:orphan:

Sklearn Basics 3: Train a Classifier on a Snowflake Multi-Table Dataset
=======================================================================

In this notebook, we will learn how to train a classifier with a more
complex multi-table data where a secondary table is itself a parent
tables of another table (ie. snowflake schema). It is highly recommended
to see the *Sklearn Basics 1* and *Sklearn Basics 2* lessons if you are
not familiar with Khiops’ sklearn estimators.

We start by importing the sklearn estimator ``KhiopsClassifier``:

.. code:: ipython3

    import os
    import pandas as pd
    from khiops import core as kh
    from khiops.sklearn import KhiopsClassifier, train_test_split_dataset
    from sklearn import metrics
    
    # If there are any issues you may Khiops status with the following command
    # kh.get_runner().print_status()

Training a Multi-Table Classifier
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We’ll train a multi-table classifier on a extension of dataset
``AccidentsSummary`` that we used in the previous notebook Sklearn
Basics 2. This dataset ``Accidents`` contains the additional table
``Users`` and is organized in the following relational snowflake schema.

::

   Accidents
   |
   | -- 1:n -- Vehicles
   |              |
   |              |-- 1:n -- Users
   |              
   | -- 1:1 -- Places              

Note that the target variable is ``Gravity``.

To train the KhiopsClassifier for this setup, we must specify a
multi-table dataset. Let’s first check the content of the tables:

-  The main table ``Accidents``.
-  The first secondary table ``Vehicles`` which has a ``1:n``
   relationship with ``Accidents``.
-  The second secondary table ``Places`` which has a ``1:1``
   relationship with ``Accidents``.
-  The tertiary table ``Users`` which has a ``1:n`` relationship with
   ``Vehicles``.

.. code:: ipython3

    accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "Accidents")
    
    accidents_file = os.path.join(accidents_dataset_dir, "Accidents.txt")
    accidents_df = pd.read_csv(accidents_file, sep="\t")
    print(f"Accident dataframe (first 10 rows):")
    display(accidents_df.head(10))
    print()
    
    vehicles_file = os.path.join(accidents_dataset_dir, "Vehicles.txt")
    vehicles_df = pd.read_csv(vehicles_file, sep="\t")
    print(f"Vehicle dataframe (first 10 rows):")
    display(vehicles_df.head(10))
    
    # We drop the "Gravity" column as it was used to create the target
    users_file = os.path.join(accidents_dataset_dir, "Users.txt")
    users_df = pd.read_csv(users_file, sep="\t")
    print(f"User dataframe (first 10 rows):")
    display(users_df.head(10))
    print()
    
    places_file = os.path.join(accidents_dataset_dir, "Places.txt")
    places_df = pd.read_csv(places_file, sep="\t", low_memory=False)
    print(f"Places dataframe (first 10 rows):")
    display(places_df.head(10))


.. parsed-literal::

    Accident dataframe (first 10 rows):


.. parsed-literal::

         AccidentId    Gravity        Date      Hour               Light  \
    0  201800000001  NonLethal  2018-01-24  15:05:00            Daylight   
    1  201800000002  NonLethal  2018-02-12  10:15:00            Daylight   
    2  201800000003  NonLethal  2018-03-04  11:35:00            Daylight   
    3  201800000004  NonLethal  2018-05-05  17:35:00            Daylight   
    4  201800000005  NonLethal  2018-06-26  16:05:00            Daylight   
    5  201800000006  NonLethal  2018-09-23  06:30:00      TwilightOrDawn   
    6  201800000007  NonLethal  2018-09-26  00:40:00  NightStreelightsOn   
    7  201800000008     Lethal  2018-11-30  17:15:00  NightStreelightsOn   
    8  201800000009  NonLethal  2018-02-18  15:57:00            Daylight   
    9  201800000010  NonLethal  2018-03-19  15:30:00            Daylight   
    
       Department  Commune InAgglomeration IntersectionType    Weather  \
    0         590        5              No           Y-type     Normal   
    1         590       11             Yes           Square   VeryGood   
    2         590      477             Yes           T-type     Normal   
    3         590       52             Yes   NoIntersection   VeryGood   
    4         590      477             Yes   NoIntersection     Normal   
    5         590       52             Yes   NoIntersection  LightRain   
    6         590      133             Yes   NoIntersection     Normal   
    7         590       11             Yes   NoIntersection     Normal   
    8         590      550              No   NoIntersection     Normal   
    9         590       51             Yes           X-type     Normal   
    
                          CollisionType             PostalAddress GPSCode  \
    0  2Vehicles-BehindVehicles-Frontal    route des Ansereuilles       M   
    1                       NoCollision  Place du général de Gaul       M   
    2                       NoCollision            Rue  nationale       M   
    3                    2Vehicles-Side       30 rue Jules Guesde       M   
    4                    2Vehicles-Side        72 rue Victor Hugo       M   
    5                             Other                       D39       M   
    6                             Other        4 route de camphin       M   
    7                             Other         rue saint exupéry       M   
    8                             Other          rue de l'égalité       M   
    9  2Vehicles-BehindVehicles-Frontal   face au 59 rue de Lille       M   
    
       Latitude  Longitude  
    0  50.55737    2.55737  
    1  50.52936    2.52936  
    2  50.51243    2.51243  
    3  50.51974    2.51974  
    4  50.51607    2.51607  
    5  50.52132    2.52132  
    6  50.52211    2.52211  
    7  50.53146    2.53146  
    8  50.53707    2.53707  
    9  50.53639    2.53639  


.. parsed-literal::

    
    Vehicle dataframe (first 10 rows):


.. parsed-literal::

         AccidentId VehicleId Direction          Category  PassengerNumber  \
    0  201800000001       A01   Unknown         Car<=3.5T                0   
    1  201800000001       B01   Unknown         Car<=3.5T                0   
    2  201800000002       A01   Unknown         Car<=3.5T                0   
    3  201800000003       A01   Unknown  Motorbike>125cm3                0   
    4  201800000003       B01   Unknown         Car<=3.5T                0   
    5  201800000003       C01   Unknown         Car<=3.5T                0   
    6  201800000004       A01   Unknown         Car<=3.5T                0   
    7  201800000004       B01   Unknown           Bicycle                0   
    8  201800000005       A01   Unknown             Moped                0   
    9  201800000005       B01   Unknown         Car<=3.5T                0   
    
           FixedObstacle MobileObstacle ImpactPoint           Maneuver  
    0                NaN        Vehicle  RightFront         TurnToLeft  
    1                NaN        Vehicle   LeftFront  NoDirectionChange  
    2                NaN     Pedestrian         NaN  NoDirectionChange  
    3  StationaryVehicle        Vehicle       Front  NoDirectionChange  
    4                NaN        Vehicle    LeftSide         TurnToLeft  
    5                NaN            NaN   RightSide             Parked  
    6                NaN          Other  RightFront          Avoidance  
    7                NaN        Vehicle    LeftSide                NaN  
    8                NaN        Vehicle  RightFront           PassLeft  
    9                NaN        Vehicle   LeftFront               Park  


.. parsed-literal::

    User dataframe (first 10 rows):


.. parsed-literal::

         AccidentId VehicleId  Seat    Category Gender TripReason    SafetyDevice  \
    0  201800000001       A01   1.0      Driver   Male    Leisure        SeatBelt   
    1  201800000001       B01   1.0      Driver   Male        NaN        SeatBelt   
    2  201800000002       A01   1.0      Driver   Male        NaN        SeatBelt   
    3  201800000002       A01   NaN  Pedestrian   Male        NaN          Helmet   
    4  201800000003       A01   1.0      Driver   Male    Leisure          Helmet   
    5  201800000003       C01   1.0      Driver   Male        NaN  ChildrenDevice   
    6  201800000004       A01   1.0      Driver   Male    Leisure        SeatBelt   
    7  201800000004       B01   1.0      Driver   Male    Leisure          Helmet   
    8  201800000005       A01   1.0      Driver   Male    Leisure          Helmet   
    9  201800000005       B01   1.0      Driver   Male    Leisure        SeatBelt   
    
      SafetyDeviceUsed            PedestrianLocation PedestrianAction  \
    0              Yes                           NaN              NaN   
    1              Yes                           NaN              NaN   
    2              Yes                           NaN              NaN   
    3              NaN  OnLane<=OnSidewalk0mCrossing         Crossing   
    4              Yes                           NaN              NaN   
    5              NaN                           NaN              NaN   
    6              Yes                           NaN              NaN   
    7              NaN                           NaN              NaN   
    8              Yes                           NaN              NaN   
    9              Yes                           NaN              NaN   
    
      PedestrianCompany  BirthYear  
    0           Unknown     1960.0  
    1           Unknown     1928.0  
    2           Unknown     1947.0  
    3             Alone     1959.0  
    4           Unknown     1987.0  
    5           Unknown     1977.0  
    6           Unknown     1982.0  
    7           Unknown     2013.0  
    8           Unknown     2001.0  
    9           Unknown     1946.0  


.. parsed-literal::

    
    Places dataframe (first 10 rows):


.. parsed-literal::

         AccidentId       RoadType RoadNumber  RoadSecNumber RoadLetter  \
    0  201800000001  Departamental         41            NaN          C   
    1  201800000002       Communal         41            NaN          D   
    2  201800000003  Departamental         39            NaN          D   
    3  201800000004  Departamental         39            NaN        NaN   
    4  201800000005       Communal        NaN            NaN        NaN   
    5  201800000006  Departamental         39            NaN          D   
    6  201800000007  Departamental         41            NaN          D   
    7  201800000008       Communal          -            NaN        NaN   
    8  201800000009  Departamental        141            NaN          D   
    9  201800000010  Departamental        641            NaN        NaN   
    
      Circulation  LaneNumber SpecialLane   Slope  RoadMarkerId  \
    0      TwoWay         2.0           0    Flat           NaN   
    1      TwoWay         2.0           0    Flat           NaN   
    2      TwoWay         2.0           0    Flat           NaN   
    3      TwoWay         2.0           0    Flat           NaN   
    4      OneWay         1.0           0    Flat           NaN   
    5     Unknown         2.0           0  Uphill           NaN   
    6      TwoWay         2.0           0    Flat          16.0   
    7      TwoWay         2.0           0    Flat           NaN   
    8      TwoWay         2.0           0    Flat           NaN   
    9      TwoWay         2.0        Bike    Flat           1.0   
    
       RoadMarkerDistance      Layout  StripWidth  LaneWidth SurfaceCondition  \
    0                 NaN  RightCurve         NaN        NaN           Normal   
    1                 NaN   LeftCurve         NaN        NaN           Normal   
    2                 NaN    Straight         NaN        NaN           Normal   
    3                 NaN    Straight         NaN        NaN           Normal   
    4                 NaN    Straight         NaN        NaN           Normal   
    5                 NaN   LeftCurve         NaN        NaN              Wet   
    6               500.0    Straight         NaN        NaN           Normal   
    7                 NaN    Straight         NaN        NaN           Normal   
    8                 NaN    Straight         NaN        NaN           Normal   
    9               670.0    Straight         NaN        NaN           Normal   
    
      Infrastructure Localization  SchoolNear  
    0        Unknown         Lane         0.0  
    1        Unknown         Lane         0.0  
    2        Unknown         Lane         0.0  
    3        Unknown         Lane         0.0  
    4        Unknown         Lane         0.0  
    5        Unknown     Shoulder         0.0  
    6        Unknown     Shoulder         0.0  
    7        Unknown         Lane         0.0  
    8        Unknown     Shoulder         0.0  
    9        Unknown         Lane         0.0  


Create the multi-table dataset specification
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Note the main table ``Accidents`` and the secondary table ``Places``
have one key ``AccidentId``. Tables ``Vehicles`` (the other secondary
table) and ``Users`` (the tertiary table) have two keys: ``AccidentId``
and ``VehicleId``.

To describe relations between tables, we add the ``relations`` field
must to the dataset spec. This field contains a list of tuples
describing the relations between tables. The first two values (``str``)
of each tuple correspond to names of both the parent and the child table
involved in the relation. A third value (``bool``) can be optionally set
as ``True`` to indicate that the relation is ``1:1``. For example, if
the tuple ``(table1, table2, True)`` is contained in this field, it
means that:

-  ``table1`` and ``table2`` are in a ``1:1`` relationship
-  The key of ``table1`` is contained in that of ``table2`` (ie. keys
   are hierarchical)

If the ``relations`` field is not present then Khiops Python assumes
that the tables are in a *star* schema with ``main_table`` as the
central table.

.. code:: ipython3

    X_accidents = {
        "main_table": "Accidents",
        "tables": {
            "Accidents": (accidents_df.drop("Gravity", axis=1), "AccidentId"),
            "Vehicles": (vehicles_df, ["AccidentId", "VehicleId"]),
            "Users": (users_df, ["AccidentId", "VehicleId"]),
            "Places": (places_df, "AccidentId"),
        },
        "relations": [
            ("Accidents", "Vehicles"),
            ("Vehicles", "Users"),
            ("Accidents", "Places", True),
        ],
    }
    y_accidents = accidents_df["Gravity"]

Split the dataset into train and test
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

We use the helper function ``train_test_split_dataset`` with the ``X``
dataset spec to obtain one spec for train and another for test.

.. code:: ipython3

    (
        X_accidents_train,
        X_accidents_test,
        y_accidents_train,
        y_accidents_test,
    ) = train_test_split_dataset(X_accidents, y_accidents, test_size=0.3)

Train a classifier with this dataset
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-  You may choose the number of features ``n_features`` to be created by
   the Khiops AutoML engine
-  Set the number of trees to zero (``n_trees=0``)

.. code:: ipython3

    khc_accidents = KhiopsClassifier(n_trees=0, n_features=1000)
    khc_accidents.fit(X_accidents_train, y_accidents_train)


.. raw:: html

    <style>#sk-container-id-1 {
      /* Definition of color scheme common for light and dark mode */
      --sklearn-color-text: #000;
      --sklearn-color-text-muted: #666;
      --sklearn-color-line: gray;
      /* Definition of color scheme for unfitted estimators */
      --sklearn-color-unfitted-level-0: #fff5e6;
      --sklearn-color-unfitted-level-1: #f6e4d2;
      --sklearn-color-unfitted-level-2: #ffe0b3;
      --sklearn-color-unfitted-level-3: chocolate;
      /* Definition of color scheme for fitted estimators */
      --sklearn-color-fitted-level-0: #f0f8ff;
      --sklearn-color-fitted-level-1: #d4ebff;
      --sklearn-color-fitted-level-2: #b3dbfd;
      --sklearn-color-fitted-level-3: cornflowerblue;
    
      /* Specific color for light theme */
      --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, white)));
      --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, black)));
      --sklearn-color-icon: #696969;
    
      @media (prefers-color-scheme: dark) {
        /* Redefinition of color scheme for dark theme */
        --sklearn-color-text-on-default-background: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-background: var(--sg-background-color, var(--theme-background, var(--jp-layout-color0, #111)));
        --sklearn-color-border-box: var(--sg-text-color, var(--theme-code-foreground, var(--jp-content-font-color1, white)));
        --sklearn-color-icon: #878787;
      }
    }
    
    #sk-container-id-1 {
      color: var(--sklearn-color-text);
    }
    
    #sk-container-id-1 pre {
      padding: 0;
    }
    
    #sk-container-id-1 input.sk-hidden--visually {
      border: 0;
      clip: rect(1px 1px 1px 1px);
      clip: rect(1px, 1px, 1px, 1px);
      height: 1px;
      margin: -1px;
      overflow: hidden;
      padding: 0;
      position: absolute;
      width: 1px;
    }
    
    #sk-container-id-1 div.sk-dashed-wrapped {
      border: 1px dashed var(--sklearn-color-line);
      margin: 0 0.4em 0.5em 0.4em;
      box-sizing: border-box;
      padding-bottom: 0.4em;
      background-color: var(--sklearn-color-background);
    }
    
    #sk-container-id-1 div.sk-container {
      /* jupyter's `normalize.less` sets `[hidden] { display: none; }`
         but bootstrap.min.css set `[hidden] { display: none !important; }`
         so we also need the `!important` here to be able to override the
         default hidden behavior on the sphinx rendered scikit-learn.org.
         See: https://github.com/scikit-learn/scikit-learn/issues/21755 */
      display: inline-block !important;
      position: relative;
    }
    
    #sk-container-id-1 div.sk-text-repr-fallback {
      display: none;
    }
    
    div.sk-parallel-item,
    div.sk-serial,
    div.sk-item {
      /* draw centered vertical line to link estimators */
      background-image: linear-gradient(var(--sklearn-color-text-on-default-background), var(--sklearn-color-text-on-default-background));
      background-size: 2px 100%;
      background-repeat: no-repeat;
      background-position: center center;
    }
    
    /* Parallel-specific style estimator block */
    
    #sk-container-id-1 div.sk-parallel-item::after {
      content: "";
      width: 100%;
      border-bottom: 2px solid var(--sklearn-color-text-on-default-background);
      flex-grow: 1;
    }
    
    #sk-container-id-1 div.sk-parallel {
      display: flex;
      align-items: stretch;
      justify-content: center;
      background-color: var(--sklearn-color-background);
      position: relative;
    }
    
    #sk-container-id-1 div.sk-parallel-item {
      display: flex;
      flex-direction: column;
    }
    
    #sk-container-id-1 div.sk-parallel-item:first-child::after {
      align-self: flex-end;
      width: 50%;
    }
    
    #sk-container-id-1 div.sk-parallel-item:last-child::after {
      align-self: flex-start;
      width: 50%;
    }
    
    #sk-container-id-1 div.sk-parallel-item:only-child::after {
      width: 0;
    }
    
    /* Serial-specific style estimator block */
    
    #sk-container-id-1 div.sk-serial {
      display: flex;
      flex-direction: column;
      align-items: center;
      background-color: var(--sklearn-color-background);
      padding-right: 1em;
      padding-left: 1em;
    }
    
    
    /* Toggleable style: style used for estimator/Pipeline/ColumnTransformer box that is
    clickable and can be expanded/collapsed.
    - Pipeline and ColumnTransformer use this feature and define the default style
    - Estimators will overwrite some part of the style using the `sk-estimator` class
    */
    
    /* Pipeline and ColumnTransformer style (default) */
    
    #sk-container-id-1 div.sk-toggleable {
      /* Default theme specific background. It is overwritten whether we have a
      specific estimator or a Pipeline/ColumnTransformer */
      background-color: var(--sklearn-color-background);
    }
    
    /* Toggleable label */
    #sk-container-id-1 label.sk-toggleable__label {
      cursor: pointer;
      display: flex;
      width: 100%;
      margin-bottom: 0;
      padding: 0.5em;
      box-sizing: border-box;
      text-align: center;
      align-items: start;
      justify-content: space-between;
      gap: 0.5em;
    }
    
    #sk-container-id-1 label.sk-toggleable__label .caption {
      font-size: 0.6rem;
      font-weight: lighter;
      color: var(--sklearn-color-text-muted);
    }
    
    #sk-container-id-1 label.sk-toggleable__label-arrow:before {
      /* Arrow on the left of the label */
      content: "▸";
      float: left;
      margin-right: 0.25em;
      color: var(--sklearn-color-icon);
    }
    
    #sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {
      color: var(--sklearn-color-text);
    }
    
    /* Toggleable content - dropdown */
    
    #sk-container-id-1 div.sk-toggleable__content {
      max-height: 0;
      max-width: 0;
      overflow: hidden;
      text-align: left;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-1 div.sk-toggleable__content.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    #sk-container-id-1 div.sk-toggleable__content pre {
      margin: 0.2em;
      border-radius: 0.25em;
      color: var(--sklearn-color-text);
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-1 div.sk-toggleable__content.fitted pre {
      /* unfitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    #sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {
      /* Expand drop-down */
      max-height: 200px;
      max-width: 100%;
      overflow: auto;
    }
    
    #sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {
      content: "▾";
    }
    
    /* Pipeline/ColumnTransformer-specific style */
    
    #sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-1 div.sk-label.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Estimator-specific style */
    
    /* Colorize estimator box */
    #sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-1 div.sk-estimator.fitted input.sk-toggleable__control:checked~label.sk-toggleable__label {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    #sk-container-id-1 div.sk-label label.sk-toggleable__label,
    #sk-container-id-1 div.sk-label label {
      /* The background is the default theme color */
      color: var(--sklearn-color-text-on-default-background);
    }
    
    /* On hover, darken the color of the background */
    #sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    /* Label box, darken color on hover, fitted */
    #sk-container-id-1 div.sk-label.fitted:hover label.sk-toggleable__label.fitted {
      color: var(--sklearn-color-text);
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Estimator label */
    
    #sk-container-id-1 div.sk-label label {
      font-family: monospace;
      font-weight: bold;
      display: inline-block;
      line-height: 1.2em;
    }
    
    #sk-container-id-1 div.sk-label-container {
      text-align: center;
    }
    
    /* Estimator-specific */
    #sk-container-id-1 div.sk-estimator {
      font-family: monospace;
      border: 1px dotted var(--sklearn-color-border-box);
      border-radius: 0.25em;
      box-sizing: border-box;
      margin-bottom: 0.5em;
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-0);
    }
    
    #sk-container-id-1 div.sk-estimator.fitted {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-0);
    }
    
    /* on hover */
    #sk-container-id-1 div.sk-estimator:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-2);
    }
    
    #sk-container-id-1 div.sk-estimator.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-2);
    }
    
    /* Specification for estimator info (e.g. "i" and "?") */
    
    /* Common style for "i" and "?" */
    
    .sk-estimator-doc-link,
    a:link.sk-estimator-doc-link,
    a:visited.sk-estimator-doc-link {
      float: right;
      font-size: smaller;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1em;
      height: 1em;
      width: 1em;
      text-decoration: none !important;
      margin-left: 0.5em;
      text-align: center;
      /* unfitted */
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
      color: var(--sklearn-color-unfitted-level-1);
    }
    
    .sk-estimator-doc-link.fitted,
    a:link.sk-estimator-doc-link.fitted,
    a:visited.sk-estimator-doc-link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }
    
    /* On hover */
    div.sk-estimator:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover,
    div.sk-label-container:hover .sk-estimator-doc-link:hover,
    .sk-estimator-doc-link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    div.sk-estimator.fitted:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover,
    div.sk-label-container:hover .sk-estimator-doc-link.fitted:hover,
    .sk-estimator-doc-link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    /* Span, style for the box shown on hovering the info icon */
    .sk-estimator-doc-link span {
      display: none;
      z-index: 9999;
      position: relative;
      font-weight: normal;
      right: .2ex;
      padding: .5ex;
      margin: .5ex;
      width: min-content;
      min-width: 20ex;
      max-width: 50ex;
      color: var(--sklearn-color-text);
      box-shadow: 2pt 2pt 4pt #999;
      /* unfitted */
      background: var(--sklearn-color-unfitted-level-0);
      border: .5pt solid var(--sklearn-color-unfitted-level-3);
    }
    
    .sk-estimator-doc-link.fitted span {
      /* fitted */
      background: var(--sklearn-color-fitted-level-0);
      border: var(--sklearn-color-fitted-level-3);
    }
    
    .sk-estimator-doc-link:hover span {
      display: block;
    }
    
    /* "?"-specific style due to the `<a>` HTML tag */
    
    #sk-container-id-1 a.estimator_doc_link {
      float: right;
      font-size: 1rem;
      line-height: 1em;
      font-family: monospace;
      background-color: var(--sklearn-color-background);
      border-radius: 1rem;
      height: 1rem;
      width: 1rem;
      text-decoration: none;
      /* unfitted */
      color: var(--sklearn-color-unfitted-level-1);
      border: var(--sklearn-color-unfitted-level-1) 1pt solid;
    }
    
    #sk-container-id-1 a.estimator_doc_link.fitted {
      /* fitted */
      border: var(--sklearn-color-fitted-level-1) 1pt solid;
      color: var(--sklearn-color-fitted-level-1);
    }
    
    /* On hover */
    #sk-container-id-1 a.estimator_doc_link:hover {
      /* unfitted */
      background-color: var(--sklearn-color-unfitted-level-3);
      color: var(--sklearn-color-background);
      text-decoration: none;
    }
    
    #sk-container-id-1 a.estimator_doc_link.fitted:hover {
      /* fitted */
      background-color: var(--sklearn-color-fitted-level-3);
    }
    </style><div id="sk-container-id-1" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>KhiopsClassifier(n_features=1000, n_trees=0)</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden><div class="sk-item"><div class="sk-estimator fitted sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" checked><label for="sk-estimator-id-1" class="sk-toggleable__label fitted sk-toggleable__label-arrow"><div><div>KhiopsClassifier</div></div><div><span class="sk-estimator-doc-link fitted">i<span>Fitted</span></span></div></label><div class="sk-toggleable__content fitted"><pre>KhiopsClassifier(n_features=1000, n_trees=0)</pre></div> </div></div></div></div>


Print the train accuracy and train auc of the model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: ipython3

    accidents_train_performance = (
        khc_accidents.model_report_.train_evaluation_report.get_snb_performance()
    )
    print(f"Accidents train accuracy: {accidents_train_performance.accuracy}")
    print(f"Accidents train auc     : {accidents_train_performance.auc}")


.. parsed-literal::

    Accidents train accuracy: 0.94598
    Accidents train auc     : 0.848822


Deploy the classifier to obtain predictions and probabilities on the test data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: ipython3

    y_accidents_test_predicted = khc_accidents.predict(X_accidents_test)
    probas_accidents_test = khc_accidents.predict_proba(X_accidents_test)
    
    print("Accidents test predictions (first 10 values):")
    display(y_accidents_test_predicted[:10])
    print("Accidentns test prediction probabilities (first 10 values):")
    display(probas_accidents_test[:10])


.. parsed-literal::

    Accidents test predictions (first 10 values):


.. parsed-literal::

    array(['NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal',
           'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal'],
          dtype=object)


.. parsed-literal::

    Accidentns test prediction probabilities (first 10 values):


.. parsed-literal::

    array([[0.12642066, 0.87357934],
           [0.00320008, 0.99679992],
           [0.08678393, 0.91321607],
           [0.01330569, 0.98669431],
           [0.01935034, 0.98064966],
           [0.04561496, 0.95438504],
           [0.01197766, 0.98802234],
           [0.06105229, 0.93894771],
           [0.31463115, 0.68536885],
           [0.00336539, 0.99663461]])


Estimate the accuracy and AUC metrics on the test data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. code:: ipython3

    accidents_test_accuracy = metrics.accuracy_score(
        y_accidents_test, y_accidents_test_predicted
    )
    accidents_test_auc = metrics.roc_auc_score(
        y_accidents_test, probas_accidents_test[:, 1]
    )
    
    print(f"Accidents test accuracy: {accidents_test_accuracy}")
    print(f"Accidents test auc     : {accidents_test_auc}")


.. parsed-literal::

    Accidents test accuracy: 0.9428324199596193
    Accidents test auc     : 0.8267824928889166